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The MEG inverse problem refers to the reconstruction of the 
, neural activity of the brain from magnetoencephalography (MEG) 

measurements. We propose a two-way regularization (TWR) method 
, |' to solve the MEG inverse problem under the assumptions that only 

0^ ■ a small number of locations in space are responsible for the mea- 

' sured signals (focality), and each source time course is smooth in time 

(smoothness). The focality and smoothness of the reconstructed sig- 
nals are ensured respectively by imposing a sparsity-inducing penalty 
and a roughness penalty in the data fitting criterion. A two-stage al- 
gorithm is developed for fast computation, where a raw estimate of 
the source time course is obtained in the first stage and then refined in 
" the second stage by the two-way regularization. The proposed method 

^ , is shown to be effective on both synthetic and real-world examples. 

m. 

1. Introduction. Magnetoencephalography (MEG) is a noninvasive neu- 
. rophysiological technique that measures the magnetic field generated by neu- 

CTN I ral activity of the brain using a collection of sensors outside the scalp [Pa- 

panicolaou (1995)]. When information is being processed at some regions of 
the brain, small currents will flow in the neural system, producing a small 
electric field, which in turn produces an orthogonally oriented small mag- 
• rH ■ netic field according to Maxwell's Equations. The MEG inverse problem 

^ | refers to recovering neural activity by means of measurements of the mag- 

netic field. The neural activities are usually represented by magnetic dipoles, 
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which are closed circulations of electric currents, that is, loops with some 
constant current flowing through. Each dipole has a position, an orienta- 
tion, and a magnitude. The inverse problem then becomes determining the 
position, orientation, and magnitude (or amplitude) of the dipoles. 

One challenge of the MEG inverse problem is that it does not have 
a unique solution and so it is ill-posed [von Helmholtz (1853); Nunez (1981); 
Sarvas (1987)]. As early as in the 19th century, von Helmholtz demonstrated 
theoretically that general inverse problems, such as those aiming at identify- 
ing the sources of electromagnetic fields outside a volume conductor, have an 
infinite number of solutions [von Helmholtz (1853)]. Hence, to derive a prac- 
tically meaningful solution from the infinitely many mathematically correct 
solutions, one has to introduce constraints to the solution and/or use prior 
knowledge about the brain activity. 

Existing approaches for the MEG inverse problem can be grouped into 
two major classes that differ in how they impose constraints on the source 
signals. Within the first class, the dipole fitting [Scherg and Von Cramon 
(1986); Hamalainen et al. (1993); Yamazaki et al. (2000); Jun et al. (2005)] 
and scanning methods [Sorrentino et al. (2009); Schmidt (1986); Mosher, 
Lewis and Leahy (1992); Veen and Buckley (1988); Van Veen et al. (1997); 
Dogandzic and Nehorai (2000)] assume that there exist a limited number 
of dipoles as point sources of the magnetic field in the brain. By constrain- 
ing the number of sources, the locations of these dipoles are estimated by 
least squares fitting [Lu and Kaufman (2003)] or iterative computing [Bail- 
let et al. (2001)]. Dipole orientations and amplitudes can be effectively esti- 
mated within these locations. However, estimating the source locations in- 
volves solving a difficult nonlinear optimization problem which has multiple 
local optima [Darvas et al. (2004)]. 

Our proposed method belongs to the second class, which contains vari- 
ous imaging methods. Different from the first class, imaging methods assume 
that there are a large number of potential dipole locations evenly distributed 
all over the cortex. By dividing the cortical region into a fine grid and at- 
taching a dipole at each grid, imaging methods model the orientations and 
magnitudes for all the potential dipoles simultaneously. Dipoles with nonzero 
magnitudes are identified as the source dipoles. Imaging methods are based 
on the theory that the primary sources can be represented as linear com- 
binations of neuron activities [Barlow (1994)]. One can express the inverse 
problem using a linear model 



where Y is an n x s matrix containing MEG time courses measured by n 
sensors at s time points, recording the amplitudes of the magnetic field. 
Without loss of generality, it is assumed that the s measurements for each 
time course are sampled at the same evenly-spaced time points. The known 
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n x p design matrix X links the source signals to the sensor measurements, 
and is computed using a boundary element model prior to application of 
Model (1) [Mosher, Leahy and Lewis (1999)]. The px s matrix B represents 
the unknown dipole activities in the form of p unobservable source time 
courses. The n x s matrix E contains some additive noise. The amplitudes 
and orientations of the signal for each dipole at a time point can be decom- 
posed into three components in the x, y, z coordinate system. Therefore, p 
represents the total number of the dipole components, and it is three times 
that of the number of grid cells. In a typical MEG study, s is usually from 
a few hundred to a few thousand, n is a few hundred, but p is as large as 
over 10,000, and so p > n. 

Defining the matrix Frobenius norm as ||B||^ = \J tr(B T B). To recover B, 
one can solve a penalized least squares problem 

(2) min{||Y-XB||| + Apen(B)}, 

where pen(-) is a penalty function that promotes certain desirable properties 
on B. 

In the literature of MEG source reconstruction, spatial focality and tem- 
poral smoothness are two valid assumptions. That is, the source signals 
are smooth in time, and only a small number of compact areas are re- 
sponsible for the recordings [Bolstad, Veen and Nowak (2009)]. Many of 
the imaging methods focus on either the first assumption or the second. 
Earlier methods using the smoothness assumption usually adopt the L2- 
norm penalty, pen(B) = ||WB||^ for certain weighting matrix W. The sim- 
plest such method is the minimum norm estimate (MNE) [Hamalainen and 
Ilmoniemi (1994)] which uses W = I. The LORETA methods [Pascual- 
Marqui, Michel and Lehmann (1994); Pascual-Marqui (2002)] set W to be 
the discrete spatial Laplacian operator. Two advantages of the I/2-penalty 
based methods are the computational efficiency and the less-spiky property 
in the time domain. Nevertheless, the L2-penalty lowers the spatial resolu- 
tion and causes the well-known "blurring effect" in the spatial domain. Uti- 
lizing the ^-penalty, the FOCUSS method [Gorodnitsky and Rao (1997)] 
reduces the blurring effect by reinforcing the strong signals while weaken- 
ing the weak ones using an iterative algorithm to update W. However, it is 
noticed that FOCUSS is very sensitive to noise [Ou, Hamalainen and Got- 
land (2009)]. Many hierarchical Bayesian approaches induce the temporal 
smoothness by employing smoothing priors which penalize discontinuities 
[see, e.g., Baillet and Garnero (1997); Daunizeau et al. (2006); Nummenmaa 
et al. (2007a)]. 

An alternative penalty is the Li-norm, pen(B) = |B| = Y2j \bij\-> which 
promotes the focality of the recovered signals. Related work includes the 
minimum current estimate (MCE) [Matsuura and Okabe (1995); Uutela, 
Hamalainen and Somersalo (1999); Lin et al. (2006)] and the sparse source 
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imaging method [Ding and He (2008)]. In contrast to the ^-penalty, the 
Li-penalty causes "spiky" discontinuities of the recovered signals in both 
temporal and spatial domains. Bayesian methods developed by Baillet and 
Garnero (1997), Friston et al. (2008), Nummenmaa et al. (2007b) take into 
account the spatial focality by employing anatomic sparse priors. However, 
these methods have similar problems as methods based on the L\ -penalty. 

To prevent the spiky property from the Li-penalty and the blurry prop- 
erty from the -^-penalty, some L/-norm methods with < / < 1 and 1 < / < 2 
have been introduced [Auranen et al. (2005); Jeffs, Leahy and Singh (1987)]. 
However, the optimization problems associated with L/-penalties are more 
difficult to solve than with L\ and L2 penalties. 

More recently, some spatio-temporal regularization methods have been 
proposed, which take into account both the smoothness and focality proper- 
ties by combining basis representation with penalization. The LiL2-regulari- 
zation discussed by Ou, Hamaiainen and Golland (2009) first projects B 
onto a temporal basis and then imposes the Li-penalty on the spatial do- 
main and the ^-penalty on the temporal domain. The event sparse penalty 
procedure [Bolstad, Veen and Nowak (2009)] divides the brain surface into 
several patches based on its anatomic features and uses temporal basis func- 
tions to represent source time courses within each patch. One drawback of 
both methods is that it is not straightforward to choose the basis. Both 
methods have some shortcomings. The former makes the assumption that 
the source temporal basis can be extracted perfectly from the MEG record- 
ings. The latter utilizes comprehensive prior information of the experiment 
task and the brain geometry. In addition, the use of basis representation can 
potentially cause information loss, since information orthogonal to the basis 
set can not be recovered after the projection to the basis set is done. 

The goal of this paper is to develop an innovative two-way regularization 
method (TWR) for solving the MEG inverse problem that promotes both 
spatial focality and temporal smoothness of the reconstructed signals. The 
proposed method is a two-stage procedure. The first stage produces a raw es- 
timate of B using a fast minimum norm algorithm. The second stage refines 
the raw estimate in a penalized least squares matrix decomposition frame- 
work. A sparsity-inducing penalty and a roughness penalty are employed to 
encourage spatial focality and temporal smoothness, respectively. 

The proposed TWR has three major advantages over the existing meth- 
ods. First, TWR regularizes in both spatial and temporal domains, and 
simultaneously takes into account both focality and smoothness properties. 
Hence, it should be superior to one-way regularization methods (e.g., MNE 
and MCE). Second, unlike some aforementioned spatio-temporal methods, 
TWR does not rely on the choice of basis functions. Hence, it avoids the in- 
formation loss due to basis approximation. Third, the two-stage procedure is 
computationally efficient. The advantages of our method are well illustrated 
in the empirical studies, which show clearly that TWR outperforms one-way 
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regularization methods that focus either on the focality or the smoothness 
alone, and some existing two-way spatio-temporal methods as well. 

Two-way regularization techniques for matrix reconstruction have been 
studied in other contexts. Huang, Shen and Buja (2009) present a two-way 
regularized singular value decomposition for analyzing two-way functional 
data that imposes separate roughness penalties on the two domains. Witten, 
Tibshirani and Hastie (2009) and Lee et al. (2010) develop sparse singular 
value decomposition methods that impose separate sparsity-inducing penal- 
ties on the two domains. However, to the best of our knowledge, a two-way 
regularization with the sparsity penalty on one domain and the roughness 
penalty on the other domain of the data matrix has not appeared in the 
literature. This paper provides a novel application of the two-way regular- 
ization method in solving the highly ill-posed MEG inverse problem, where 
different types of penalties are naturally used to meet the dual requirements 
of spatial focality and temporal smoothness on the unknown source signals. 

The rest of the paper is organized as follows. Section 2 presents the details 
of the TWR methodology including the computational algorithm. Through 
a synthetic example, Section 3 shows advantages of the TWR over some 
existing methods for solving the MEG inverse problem. Section 4 applies 
the TWR to a real- world MEG source reconstruction problem. Section 5 
concludes the paper with some discussion about an alternative one-step 
approach and related complications. 

2. Methodology. We propose a two-way regularization (TWR) method 
to regularize the recovered signals in both spatial and temporal domains. 
We adopt a penalized least squares formulation that uses suitable penalty 
functions to ensure the spatial focality and the temporal smoothness of the 
recovered signals. TWR is implemented in a two-stage procedure where the 
first stage produces a rough estimate of the source signals and the second 
stage refines the initial rough estimate using regularization. 

2.1. Stage 1. The goal of Stage 1 is to obtain a rough estimate of the 
location and the shape of the source signals. At this stage, source information 
in the data is retained as much as possible. It is natural to obtain such 
a rough estimate by solving the following minimization problem: 

(3) min||Y-XB||i. 

B 

Note that the forward operator X contains the information of positions 
and orientations of the dipoles, and how they are represented at the sensor 
level. This information can be decomposed by applying a singular value 
decomposition (SVD) to X, that is, X = UDV T , where U G R nxn is an 
orthogonal matrix and V £ W xn is a thin (since p> n) orthonomal matrix, 
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such that U U = UU = I and V V = I. Then the objective function in 
the optimization problem (3) becomes 

|| Y - UDV T B||| = ||U T Y - DV T B|||. 

Let Y = U T Y and C = V r B. The minimization problem (3) is equivalent 

to 

(4) min||Y-DC||i. 

B 

Let yf and cf be the ith row of Y and the ith row of C, respectively. 
Since D is a diagonal matrix, the minimization problem (4) can be obtained 
by separately solving for each i, 

min{||yi - diCi\\ 2 }, 

where d{ is the ith diagonal element in D. This problem has a unique solution 
Cj = yi/di. Then the estimated matrix C with cf in the ith row can be 
obtained. Thus, a rough estimate of B can be obtained by solving 

(5) C = V T B. 

Note that C is n x s, V is p x n, and B is p x s. Since p> n and p S> s, 
equation (5) does not have a unique solution for B. Any solution of (5) can be 
written as B^ = VC + V-'-F, where is a p x (p — n) orthonormal matrix 
whose columns are orthogonal to the columns of V and F is a (p — n) x s 
matrix. We pick the minimum norm solution, which is B = VC. In fact, B 
solves the following optimization problem: 

min||B||^ subject to ||Y — XB||_p = 0. 

B 

We can see this by noticing that ||Bt||| = ||VC||| + ||V- L F||| > ||B||| and 
the equality holds when F is a matrix of zeros. 

We call this B the raw estimate. Note that the raw estimate can only 
recover information that lies in the column space of V, and thus any infor- 
mation orthogonal to the columns of V is lost. Since the columns of V are 
the right singular vectors of X, the column space of V is equivalent to the 
row space of X. This information loss can also be understood by viewing the 
forward operator X as a filter that maps the source B to the space of the 
observations, Y, and so the information in the columns of B that is orthog- 
onal to the rows of X can not be recovered. Since all imaging methods are 
based on Model (1), information loss is a common problem to these meth- 
ods. This is the limitation of the MEG technology. Fortunately, according 
to our experience, most important information still remains in many real- 
world applications, as we will see in our real data example. Note that the 
methods that require basis representation may cause additional information 
loss, since any information in the columns of B that is orthogonal to the 
basis chosen will also be lost. 
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2.2. Stage 2. It is obvious that the raw estimate, B, can be noisy. The 
purpose of Stage 2 is to polish the raw estimate by incorporating the smooth- 
ness and focality assumptions. The polished solution from this stage is de- 
noted as B. As we will see in the simulation study in Section 3, the shapes 
of the time courses in the rows of B are noisy but usually follow the shapes 
of the true curves, and B may suggest a broader range of active regions. In 
a penalized least squares framework, we apply a roughness penalty to smooth 
the recovered time courses and apply a sparsity-inducing L\ penalty to refine 
the active regions. 

In order to apply two penalty functions to B, we first use the two- 
way structure of the raw estimate and decompose it into spatial-only and 
temporal-only components. Specifically, we write B as 

(6) B = AG T , 

where the matrix G € M sxq (q< s) contains only the temporal features of B, 
and A 6 MP xq can be treated as the spatial coefficients. When q < s, the de- 
composition (6) suggests a reduced-rank representation of B. Our empirical 
studies, however, suggest that any reduced-rank representation would lead 
to information loss and thus the full rank model is needed in practice. We 
shall focus on the full rank model (g = s) for the rest of the paper. For 
identification purposes, we require that G is an orthogonal matrix, that is, 
G T G = GG T = I. 

Note that when the full rank model is used, the reconstruction error of 
using AG T to represent B, ||B — AG T |||,, is exactly zero. We propose to 
introduce focality and smoothness requirements on A and G respectively 
at the cost of allowing some errors in reconstructing B. In particular, we 
consider the following penalized least squares problem: 

(7) min{||B - AG T |[^ + m pen^A) + /j, 2 pen 2 (G)}, 

A,G 

where pen x (A) and pen 2 (G) are appropriate penalty functions, and fii 
and /j,2 are the corresponding penalty parameters. 

To ensure the spatial focality of the recovered source signals, we employ 
a sparsity-inducing penalty on A so that the estimated A is a sparse matrix, 
that is, a large proportion of its entries are zero. Note that if a row of A 
has all zero entries, then the corresponding row of B has all zero entries, 
indicating no signal or an inactive location. Although other choices are pos- 
sible, we use the L\ penalty pen 1 (A) = |A| = Yli=i Sj=i \ a ij\ to serve our 
purpose. On the other hand, to induce smoothness to the time course of the 
recovered source signals, we apply a roughness penalty to the columns of G 
so that each column of G is a smooth function of time. Let g = (g\, . . . ,g s ) T 
represent a generic vector representing a column of G. One choice of the 
roughness penalty is the squared second order difference penalty, defined as 
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pen(g) = Yli=2 (.9l-\ ~ 2gi + 9i+i) 2 • This penalty is a quadratic form and can 
be written as g r f2g for a nonnegative definite roughness penalty matrix, ft. 
The overall penalty on G is the summation of the penalty on each column, 
pen 2 (G) = tr(G T f2G) = J^ =1 gjfigj . Using the penalties defined above, 
the penalized least squares problem (7) becomes 

(8) min{||B - AG T ||| + W |A| + ^ 2 tr(G T ftG)}. 
A,G 

2.3. Algorithm. We propose an iterative algorithm to solve (8) that al- 
ternates the optimization with respect to A and G. The algorithm starts 
with setting the initial G to be the orthonormal matrix of the right singular 
vectors from the SVD of B. That is, let B = LTR T , where L and R are 
orthonormal matrices, and we set the initial G = R. 

Fix G, update A. When G is fixed as G, the roughness penalty term 
in the objective function (8) is irrelevant to the optimization of A. Thus, 
updating A reduces to solving the problem 

(9) mm{||B-AG T ||| + /xi|A|}. 

A 

This is similar to one step of the iterative algorithm for the sparse principal 
component analysis as formulated by Shen and Huang (2008). Express AG T 
as a summation of a serial of rank-one terms 

s 

(10) AG T = ^ ai gJ, 

i=i 

where a,- and gj are the jth column of A and G, respectively. Since simul- 
taneous extracting of all the rank-one terms is computationally expensive, 
we propose to obtain them sequentially. 

For the first rank-one term (j = 1), we solve for fixed gi 

(11) min{||B-aigf||^ + Mi|ai|}. 

ai 

This problem has a closed-from solution which is given below. For the sake 
of notational simplicity, we drop the subscripts for now and express the 
objective function of (11) as 

||B - ag T |||i + /ii|a| 

(12) 

p ( s s s \ 

= H\ a iJ2^i ~ 2a i^2 b u9i + J2b 2 u + m\ai\ >, 

i=l I 1=1 1=1 1=1 J 

where bn is the (i, Z)th element in B, and a^, i = 1, . . . ,p, are the elements 
of the vector a. The minimization of (12) is equivalent to independently 
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solving p optimization problems 



(13) mmlat^2g?-2a i ^2b u g l +[xi\a i \\, i = l,...,p. 

0,1 V 1=1 1=1 / 

According to Lemma 2 of Shen and Huang (2008), the minimizer of each 
objective function in (13) is the soft thresholding rule 

(14) hi = sign(rj)(|ri| - A) + , 

where rt = Ya=i bum/ Yh=i 9i \ and A = \L\j (2 Ya=\ 9i)- The p-vector a that 
minimizes (12) is a = (a\ : . . . , a p ) T . 

After the first rank-one term a-igf is obtained, we find the second rank- 
one term by solving the following minimization problem, while fixing g 2 : 

min{||(B -aigf) -a 2 g| , ||| + /ii|a 2 |}. 

This is the same problem as (11) except that the B in (11) is replaced by the 
residual B reSi i = B — a.igj from the rank-one approximation. The rest of the 
rank-one terms, a/g^, I = 3, . . . , s, can be obtained sequentially in a similar 
manner by using the residuals from the lower-rank approximations. 

Fix A, update G. When A is fixed as A, the L\ penalty term in (8) 
becomes constant and thus is irrelevant to the optimization with respect 
to G. The update of G then solves the following problem: 

(15) min{||B - AG T ||^ + ^ tr(G T fiG)}. 

G 

Since directly solving this problem is complicated, we solve for the columns 
of G sequentially. To obtain the first column of G, we solve the problem 

(16) min{||B-aigf||! + /x 2 gfftgi}, 

gi 

which has the solution gi = (a^ail + ^2^) _1 B T ai. Let Q = PAP T be the 
eigen-decomposition. Then 

gl = P(a?ail + M 2 A)- 1 P T B T ai. 

To obtain an update of g 2 , we solve the problem 

(17) min{||(B-aigf) -a 2 g 2 r ||^ + /i 2 g 2 r f2g 2 }, 

g2 

which has the solution 

g 2 = (a 2 a 2 I + ^ 2 fJ)~ 1 (B - aigf ) T a 2 

(18) 

= P(a|a 2 I + /i 2 A)- 1 P T B^ 1 a 2 , 

where again B reSj i = B — aig^ is the residual from the rank-one approxima- 
tion. The rest of g/, I = 3, . . . , s, can be obtained similarly using the residuals 
from the corresponding lower rank approximations. When all columns of G 
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are obtained, we orthonormalize the columns of G by taking the QR decom- 
position of G and assigning the Q matrix to G. 

The iterative TWR procedure, including Stage 1 and Stage 2, is summa- 
rized in Algorithm 1. 

We consider the algorithm has converged if the Frobenius norm of the 
relative difference between the current solution and the previous solution 
is smaller than a prespecified threshold value. In our implementation, we 
declare convergence when ||Bj - Bj_i 11^/1113^ < 10" 6 . Based on our em- 
pirical studies, only a few iterations are needed to reach convergence; 15 
iterations are usually sufficient for our numerical examples in Sections 3 
and 4. 

2.4. Tuning parameters. There are two tuning parameters in the TWR 
algorithm: the focality parameter, fii, and the roughness penalty parame- 
ter, fj,2- The choice of fx\ and \xi can be done using the cross-validation (CV) 
techniques and the generalized cross-validation, respectively. 

To select fj,±, we can utilize the leave-one-out CV that minimizes the 
leave-one-out CV score defined as 

1 n 

(19) CV( Atl ) = -V;||Y i -X i A_ i G^|||, 

i=i 

where Yj is the zth row of Y corresponding to the ith time course, Xj is 
the ith row of X, and A_j and G_j are the estimates of A and G using 
all observations except the ith time course. However, practical application 
of the CV has some difficulties. The gradient-based optimization is not fea- 
sible for minimizing the CV score since it is not a smooth function of fj,\, 
a consequence of using the L\ penalty. In addition, direct computation of 
the CV score is costly because of the usual large scale of the real problem. 
In a typical MEG study, n is over 200, s is a few hundred, and p can be over 
15,000. In order to reduce the computational cost, we propose to use the 
iT-fold cross-validation. Specifically, we divide the rows of Y and X into K 
about equally sized parts and leave out one part each time for validation, 
and use the rest of the parts for estimating A and G. The i^-fold CV score 
is defined as 

1 K 

(20) CV(mi) = ^EH Y (fe) " X (fc) A_ (fc) G^ (fc) ||^, 

k=l 

where Y^ contains the kth part of the rows of Y, ~X-(k) contains the cor- 
responding rows of X, and A—(k) an d Gr/j.) are the estimates of A and G 
using all observations except the kth part of time courses that are left out 
for validation. We used K = 5 in our implementation. To further speed up 
the algorithm, we restrict our search only in a moderate-sized set of discrete 
candidate values for Such restrictive search is satisfactory, since we find 
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Algorithm 1: The TWR algorithm 



Input: X, Y, fj, x , fj, 2 , q 

Output: B 

begin 

Stage 1: 

Obtain the SVD of X: X = UDV T , D 
Y^U T Y, Y = (yf,...,£) T 
for i <— 1 to n do 



diag(di, ... ,4 



C 
B 



vc 



c T F 



Stage 2: 

Obtain the SVD of B: B 
Initialization: G R, G 
Obtain eigen-decomposition of ft: CI 
repeat 

Update A: 



LTR 

: (9jl) 



PAP 



T 



B, B res — (^res,iz) 
- G W xs 
for j ' 1 to q do 



Bros 

aogo 



B rC s 
for i 



B 



res 
Ml 



%-lgJ-l 



r 



<— 1 to p do 

frrcs, il9jl 



'J 



sign(r- ij -)(kiil - Aj) 



A<- (ay) 
Update G: 
B 

- € R pxs 
for j ^— 1 to g do 



B re s 

aogo 



B, 



^— B rcs — a ; . gj | 



g,- <- P(aJa,I + ^Ay^Bj^kj 

G<-(gl,...,gg) 

Obtain QR decomposition of G: G = QR 
G^Q 

until convergence of B <— AG T 



end 
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that the results are usually not very sensitive to mild changes of \x\ (see 
Sections 3 and 4) and thus fine tuning of \i\ is not necessary. We used 10 
different values evenly-spaced between and 1 for \x\ in our simulations and 
the real-world MEG example; the search range may need to be changed for 
different problems. 

To select /j>2, note that given A, the update of q columns of G can be 
obtained by solving q separate penalized regression problems. For the ^'th 
column, the regression has ISj cs j_ 1 &j as the input, where B resj _i = B — 

Siigf , gj as the output, and the hat matrix of the regression is Sj = 
P(aJa.jI + /i2A)~ 1 P T , according to equation (18). Theoretically, fi2 can take 
different values for different gj's, but we decide to use a common \i2 for all 
the gj's based on computational efficiency consideration. The advantage of 
this strategy is that there is only one optimization problem to solve for 
choosing the tuning parameter when updating G. Then, the overall GCV 
criterion is the average of all individual GCV criteria: 

where B reS] o = B, and tr(Sj) = X]f=i V{^y + M2^} • The GCV optimization 
is nested in the iterations because it is defined conditioning on the current 
value of A. Since the GCV criterion is a smooth function of (j>2, the op- 
timization can be done using a combination of golden section search and 
successive parabolic interpolation [Brent (1973)]. 

2.5. One-way regularization. By separately setting one of the penalty 
parameters in (8) to be zero, one can obtain two different one-way regular- 
ization methods: tOWR and sOWR, as explained below. These two one-way 
regularization methods will be used as a comparison to TWR to demonstrate 
the need for two-way regularization. 

Letting fj,± = leads to a method that emphasizes temporal smoothness 
of the recovered signals, which is referred to as tOWR (temporal one-way 
regularization), and is related to the functional PCA [Huang, Shen and Buja 
(2008)]. The corresponding optimization problem becomes 

(22) min{||B - AG T \\ 2 F + fi 2 tr(GftG T )}. 
A,G 

A modified version of Algorithm 1 can be applied for computation, with the 
"Update A" step in the algorithm simplified to A = B r G. 

Letting fi2 = leads to a method that encourages spatial sparsity of the 
recovered signals, which is referred to as sOWR (spatial one-way regulariza- 
tion) and is related to the sparse principal component analysis of Shen and 
Huang (2008). In this case, the optimization problem (8) reduces to 

(23) min{[|B-AG T ||| + /ii|A|}. 
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Fig. 1. Simulated source and sensor data. 



Again, a modified version of Algorithm 1 is applicable, but with the "Up- 
date G" step simplified to G = B T A. 

3. Synthetic example. In this section we illustrate the proposed TWR 
method using a synthetic example that mimics human brain activities. Both 
the source and the forward operator are created based on real-world MEG 
studies. 



3.1. Data generation. We generated the forward operator, X, from a human 
subject head boundary element model using the MNE software (available at: 
http : //www . nmr . mgh . harvard . edu/martinos/user Info/data/ sof MNE . php). 
The X matrix is a 248 x 15,360 matrix, corresponding to a MEG device 
with 248 valid channels. To mimic real- world scenarios and ensure enough 
difficulty of the problem, we located two source areas on the left and the 
right hemispheres, respectively. The sources were generated from two sine- 
exponential functions [Bolstad, Veen and Nowak (2009)] and are shown in 
Figure 1(a). The black solid and the red dashed curves are source signals 
located at the left motor and the right visual cortical areas, respectively. 
As we can see, the sources reach their energy peaks at 25 ms and 58 ms, 
respectively. The synthetic MEG time courses were generated using equa- 
tion (1) and were obtained using a sampling frequency 355 Hz with a dura- 
tion of 200 seconds [see Figure 1(b)]. By mimicking the real MEG data after 
preprocessing, that is, denoising and smoothing, the signal-to-noise ratio, 
SNR= ||XB|||/||E|||, is set to be 5dB. 

3.2. Comparison criteria. We compare TWR with eight different meth- 
ods that can be put into two categories as given below. 
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• One-way regularization: 

- The L 2 -based MNE method [Mosher, Leahy and Lewis (1999)] 

- The Li-based MCE method [Matsuura and Okabe (1995)] 

- sOWR (i.e., spatial sparsity only) 

- tOWR (i.e., temporal smoothness only) 

• Two-way regularization: 

- The L1L2 method proposed by Ou, Hamalainen and Golland (2009) 

- MNE+sOWR (i.e., obtaining the MNE solution as Stage 1 and then 
applying sOWR) 

- MCE+tOWR (i.e., obtaining the MCE solution as Stage 1 and then 
applying tOWR) 

- MNE+TWR (i.e., obtaining the MNE solution as Stage 1 and then 
applying Stage 2 of TWR) 

We put MNE+sOWR in the two-way regularization category because 
the L<i penalty in MNE puts constraints on both domains, and sOWR puts 
the L\ penalty only on the spatial domain. As a result, the temporal do- 
main is regularized by the L2 penalty, while the spatial domain is regularized 
first by the L2 penalty and then by the L\ penalty. Similarly, MCE+tOWR 
is also categorized as a two-way regularization method. MNE+sOWR and 
MCE+tOWR can be considered as two alternative ways for two-way reg- 
ularization and are suggested by a reviewer. MNE+TWR, also suggested 
by a reviewer, is a slight modification of TWR, replacing the first stage of 
TWR by MNE. Its inclusion in comparison helps us study the effect of using 
a different Stage 1 estimator on the performance of TWR. We implemented 
all the methods in R, and the tuning parameters are selected using either 
CV or GCV. 

Three comparison criteria are utilized: the overall mean squared error 
(MSE), the standardized distance between the energy peak of the estimated 
source and the energy peak of the true source, and the computation time. 

The overall MSE is defined as 

M££ = -||B-B|||, 
P 

where B and B are the true and recovered source matrices, respectively. 
The energy of the dipole j at time point k is defined as {tfj k x + tfj ky + 

b %,z) 1/2 ' where b jk>x , b jKy , b jKz (j = 1, . . . ,p, k = l,...,s), are the ampli- 
tude components for the jth dipole at the time point k in the Cartesian 
coordinate system. The energy of the reconstructed source can be defined 
similarly. The standardized distance between the estimated and the true 
energy peak at time point k is defined as 

, _ VK - z k y + { y * - y k y + (4=W 
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Table 1 

Comparison of nine methods using four criteria: the mean squared error (MSE), 
the standardized distance between the true energy peak and the estimated energy 
peak at the left motor area (d,2s), at the right visual area (dss), and the computation 
time (in seconds). Reported are the average and standard error of each criterion 
based on 100 simulation runs 



Method 


MSE (10- 3 ) 


d 25 (XIO" 4 ) 


eZ 58 (XIO" 4 ) 


Computation time (sec.) 


MNE 


544.0 (9.0) 


50.2 (7.3) 


42.9 (5.9) 


4371 (4.3) 


MCE 


903.7 (8.9) 


337.1 (6.4) 


156.1 (11.4) 


1545 (3.0) 


tOWR 


407.9 (8.9) 


40.2 (5.8) 


39.6 (4.3) 


1841 (3.4)* 


sOWR 


153.2 (7.7) 


19.3 (4.6) 


13.9 (3.9) 


1798 (3.6)* 


TWR 


22.3 (5.7) 


15.7 (3.3) 


7.1 (2.4) 


1872 (3.5)* 


L1L2 


44.3 (7.1) 


31.0 (6.1) 


17.8 (2.3) 


40,872 (8.8) 


MNE+sOWR 


187.3 (8.8) 


27.9 (6.8) 


14.5 (3.1) 


5998 (3.9) 


MCE+tOWR 


912.7 (10.9) 


343.8 (6.2) 


145.2 (12.7) 


3321 (3.8) 


MNE+TWR 


28.6 (7.2) 


16.9 (4.3) 


10.7 (3.9) 


6201 (3.1) 



*The computation time for each simulation run is computed based on 15 iterations, which 
are usually more than needed for algorithm convergence. 



where p/3 is the total number of dipoles, x* k , y k , z* k are the coordinates of the 
location for the maximum source energy at time point k, and x, y, z are the 
coordinates for the maximum estimated source energy at the corresponding 
time point. In this simulation example, there are two peak times, 25 ms and 
58 ms, so we are interested in c?25 and d^s- 

3.3. Results. The simulation was conducted 100 times with the noise 
term in Model (1) newly generated for each run. The criteria described in the 
previous subsection (i.e., MSE, g?25, d^s, computation time) were evaluated 
for each simulation run, and the mean and standard error of the criterion 
values across the 100 runs were calculated. The numerical results are shown 
in Table 1. 

Several interesting observations can be made from the table. TWR is the 
best method in the sense of having the smallest MSE and the shortest dis- 
tances between the true and the estimated peaks. Among the four one-way 
regularization methods, sOWR and tOWR outperform the classical MNE 
and MCE methods, and tOWR outperforms sOWR. The fact that TWR 
outperforms the four one-way regularization methods justifies our proposal 
of using two-way regularization. The L1L2 method is the third most accu- 
rate method, but its computation time is more than 21 times as large as 
that of TWR. MNE+sOWR and MCE+tOWR are less satisfactory, demon- 
strating the importance of the first stage. MNE+sOWR is not better than 
sOWR because the L2 penalty of MNE does not smooth the temporal do- 
main. The performance of MCE+tOWR is similar to MCE and is not better 
than tOWR because MCE does not recover well important information at 
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the first stage, and hence tOWR based on MCE is inaccurate. Note that the 
reported computation time for TWR, sOWR and tOWR are based on fixed 
15 iterations in order to make the calculation of the average computation 
time meaningful. Such report is conservative because these algorithms usu- 
ally converge rapidly and fewer iterations (usually less than 10) are enough 
to obtain considerably good accuracy. 

Figures 2 and 3 show the 3-D brain mapping by different methods at 
25 ms and 58 ms for a randomly selected simulation run. TWR performs 
the best among the nine methods in detecting the true source locations 
even though it misses some small regions. It is able to identify the majority 
parts of both source locations, and its solutions are focal. Solutions from 
sOWR and MNE+sOWR are more scattered than TWR. MNE and tOWR 
produce even more diffuse solutions. MCE misses the main parts of both 
active areas and so does MCE+tOWR, and they are the least satisfactory 
methods. The L1L2 method recovers some of the activity, but the solution 
is overly focal. The plot of MNE+TWR is very similar to that of TWR, so 
it is not presented here to save space. Direct comparison of results of TWR 
and tOWR clearly demonstrate the positive effect of using regularization in 
the spatial domain. 

Figures 4 and 5 show the true and the recovered time courses by the nine 
methods for an arbitrarily chosen single dipole component in the two active 
areas, respectively, for a randomly selected simulation run. Each subfigure 
shows the true time course and the estimated time course by one method. As 
one can see, the methods considering the temporal smoothness reconstruct 
the shape of the source time course well. TWR, tOWR, LiL 2 , MCE+tOWR 
and MNE+TWR all produce smooth time courses. TWR recovers the most 
energy of the source, while MCE+tOWR recovers the least. MNE+TWR 
tends to overshrink the amplitude of the time course because MNE over- 
shrinks the amplitude. The methods without considering the roughness reg- 
ularization in the temporal domain result in noisy time courses even though 
some methods can recover the general trend. In Figure 5(b), MCE does 
not capture the major peaks of the signal, and, consequently, MCE+tOWR 
[Figure 5(h)], which relies on the solution of MCE, does not recover any sig- 
nal activity either. Direct comparison of results of TWR and sOWR clearly 
demonstrate the positive effect of using regularization in the time domain. 

The selection of the focality parameter and the roughness penalty param- 
eter was conducted using the method presented in Section 2.4. Figure 6(a) 
and (b) shows the CV and GCV scores for TWR as functions of fi± and 112, 
respectively. The optimal values of the tuning parameters are \i\ = 0.33 and 
fi2 = 5.9. Figure 6(c) shows the sparsity level of the reconstructed source 
matrix, B, for TWR as a function of the number of iterations when the tun- 
ing parameters are set at the selected values. The sparsity level for a matrix 
is defined as the number of zero entries over the total number of entries. 
Here the total number of entries for B is p x s = 3,072,000. From this fig- 
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(g)LiL 2 (h)MNE+sOWR (i) MCE+tOWR 



Fig. 2. Overviews of brain mapping by different methods at 25 ms. (a) shows the true 
map, indicating an active area located at the left motor area. TWR identifies the major 
active area and the solution is focal. The L1L2 method also identifies the active area 
but the solution is too focal. sOWR and MNE+sOWR produce more scattering solutions 
than TWR. MNE and tOWR detected active areas are diffuse. MCE and MCE+tOWR 
misidentify the active region. 
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(a) Truth 



(b)MNE 



(c) MCE 






(d) TWR 



(e) sOWR 



(f) tOWR 






(g) L1I4 



(h) MNE+sOWR 



(i) MCE+tOWR 



Fig. 3. Stdeviews of brain mapping by different methods at 58 ms. (a) shows the true 
map, indicating an active area located at the right visual area. TWR and L1L2 identify 
the major active area and the solution is focal. sOWR and MNE+sOWR produce more 
scattering solutions than TWR. MNE and tOWR detected active areas are diffuse. MCE 
and MCE+tOWR misidentify the active region. 



ure, we observe that the sparsity of B levels off rather rapidly and stays 
steadily at about 0.996, a fairly high sparsity level. In fact, this sparsity 
level matches closely the true level in the simulation setup: The number of 
true source dipoles is 20, and so the total number of active source compo- 
nents is 60 after considering orientations. Thus, the true sparsity level is 
1 - 60/p = 1 - 60/15360 « 0.996. 

4. Real data example. In this section we demonstrate the proposed meth- 
od using a human MEG data set obtained from the Center for Clinical Neu- 
rosciences at the University of Texas Health Science Center at Houston. The 
study subject is a 44- year-old female patient with grade three left frontal 
astrocytoma who underwent the MEG test as part of the presurgical eval- 
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Fig. 4. Estimated time courses for one arbitrarily chosen dipole component at left mo- 
tor area by different methods for a randomly selected simulation run. TWR, tOWR, 
MCE+tOWR, L\L2 and MNE+TWR recover the shape of the time course reasonably well 
and the solutions are smooth. But MCR+tOWR, MNE+TWR and L1L2 overshrink the 
amplitude. MNE, MCE, sOWR and MNE+sOWR estimate the general trend reasonably 
well, but the estimated time courses are too noisy. TWR gives the best result. 



uation. The patient underwent a somatosensory task which is designed to 
noninvasively identify the somatosensory areas of the patient. We choose 
this study because of the clinical usefulness of the somatosensory task in 
presurgical mapping. 
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Fig. 5. Estimated time courses for one arbitrarily chosen dipole component at right visual 
area by different methods for a randomly selected simulation run. TWR, tOWR, L1L2 and 
MNE+TWR recover the shape of the time course reasonably well and the solutions are 
smooth. But MNE+TWR overshrinks the amplitude. MNE, sOWR and MNE+sOWR es- 
timate the general trend reasonably well, but the estimated time courses are too noisy. 
MCE and MCE+tOWR do not recover the shape of the time course. TWR gives the best 
result. 



Data collection was done with a whole-head neuromagnetometer contain- 
ing 248 first-order axial gradiometers. During the MEG somatosensory ses- 
sion, 558 repeated stimulations were delivered to the patient's right lower 
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tions 

Fig. 6. Selection of /ii and /12 and the sparsity level as a function of the number of 
iterations. The optimal fii and fi2 are around 0.33 and 5.9, respectively. The sparsity 
measure levels off at around 0. 996. 



lip through a pneumatically driven soft plastic diaphragm. Each stimulation 
lasted 40 ms with 450 ms epoch duration (including a prestimulus base- 
line of 100 ms) and an interstimulus interval randomized between 0.5 s and 
0.6 s. We removed the offset and averaged the 558 epochs to obtain the fi- 
nal event-related magnetic field response. Then a bad channel was removed. 
The MEG device recorded 228 time points in each epoch. The measurement 
matrix Y is 247 x 228, where n = 247 is the number of valid MEG channels 
and s = 228 is the number of recorded data points per epoch. The n x p 
forward operator X was obtained using the MNE software with p = 15,372. 

The measured MEG recordings from the 247 valid channels are plotted in 
Figure 7(a). Among the 228 time points, there are two peaks at time points 




Time Time 
(a) Measured data (b) Reconstructed source 



Fig. 7. MEG data, (a) MEG recordings from 247 valid channels; (b) Reconstructed time 
courses from an arbitrary source location in the somatosensory area by different methods. 
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85 and 99, corresponding to the activation of the primary somatosensory 
area contralateral to the stimuli, as expected by clinical experiences and 
brain anatomic theories. 

Nine methods, MNE, MCE, TWR, tOWR, sOWR, MNE+TWR, 
MNE+sOWR, MCE+tOWR and LiL 2 , were applied to solve the MEG 
inverse problem. Figure 7(b) shows the reconstructed time courses for an 
arbitrary source location by different methods. As we can see, TWR, sOWR 
and tOWR, are satisfactory in terms of estimating the shape of the source 
time course and capturing the peak features at time points 85 and 99. But 
sOWR produces a noisy time course. MNE and MNE+sOWR overshrink 
the magnitudes in addition to producing a noisy time course. MNE+TWR 
recovers the shape of the time course but underestimates the amplitude. The 
L\L<i method does not distinguish the two peaks. MCE only identifies the 
first peak but misses the second one. MCE+tOWR does not capture any 
activity because it smoothes the spikes caused by MCE and hence is the 
least satisfactory method. 

Figure 8 shows the side views of the brain mapping at time point 85 
by different methods. As we can see, the somatosensory area was correctly 
identified by TWR, which matches the clinical expectation. As with the 
synthetic example, tOWR and MNE produce diffuse solutions, leading to 
false positives around the somatosensory area. sOWR produces a scattering 
solution and so does MNE+sOWR. MNE+TWR and L X L 2 also identify 
some activity in the frontal lobe. Solutions from MCE and MCE+tOWR 
are too focal and do not cover the somatosensory area. 

Figure 9(a) shows the CV error as a function of The CV error was 
minimized when the sparsity parameter, fMi, is about 0.44. Figure 9(b) dis- 
plays the GCV error as a function of //2- It shows that the optimal fj, 2 is 
about 59.5. The sparsity level as a function of the number of iterations is 
shown in Figure 9(c). As we can see, the sparsity level increases at first and 
then levels off rapidly, indicating the algorithm converges fast. The optimal 
sparsity level was about 0.999. 

5. Discussion. TWR solves the MEG inverse problem by using two-way 
penalties that promote both the temporal smoothness and the spatial focal- 
ity of the solution. We developed a computational efficient two-stage proce- 
dure for implementing TWR. We also considered a one-stage approach that 
tries to recover the source signal matrix B = AG T by solving 

(24) min{||Y-XAG T ||J. + /zi|A| + /x 2 tr(G T fJG)}. 

A,G 

The optimal matrices A and G can be obtained by alternating optimiza- 
tion. When fixing A as A, the optimal G can be obtained as in Algo- 
rithm 1, as described in Section 2.3. When fixing G as G, the problem (24) 
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Fig. 8. Side views of the brain mapping at time point 85 by different methods. TWR 
provides a focal and accurate detection; MNE+TWR and L1L2 identify some activity in 
the frontal lobe in addition to the somatosensory area. Solutions from MNE, sOWR, tOWR 
and MNE+sOWR are too diffuse to be satisfactory. Both MCE and MCE+tOWR miss 
the activity in the somatosensory area. 



becomes 



(25) 



min{||Y-XAG T ||^ + W |A|} 

A 



= min{tr[G(YG - XA) J (YG - XA)G J ] + pi|A|} 

A 

= mm{||YG-XA|||. + ^i|A|}, 

A 

which is equivalent to s different problems, one for each column of A, namely, 



mm 



|Ygj -Xaj|| 2 + ^1^1}, 



j = l,...,s, 



where gj is the jth column of the matrix G. Each of these problems is 
a standard LASSO regression problem [Tibshirani (1996)] with over 10,000 
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Fig. 9. Selection of /ii and /12 and the sparsity level as a function of the number of 
iterations. The optimal /ii and 112 are around 0.44 an d 59.5, respectively. The sparsity 
measure levels off at around 0. 999. 

variables. Although efficient computational algorithms exist for the LASSO 
regression, the fact that the LASSO problem needs to be solved a few hun- 
dred times during each iteration of updating A makes this approach com- 
putationally unattractive. Developing a scalable algorithm for the one-stage 
approach is an important issue for its practical application and remains an 
interesting research topic. 
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